Skip to content

perf(conservative): apply weights via scipy CSR matmul (robust without opt-einsum)#74

Draft
thodson-usgs wants to merge 1 commit into
xarray-contrib:mainfrom
thodson-usgs:perf/conservative-csr-apply
Draft

perf(conservative): apply weights via scipy CSR matmul (robust without opt-einsum)#74
thodson-usgs wants to merge 1 commit into
xarray-contrib:mainfrom
thodson-usgs:perf/conservative-csr-apply

Conversation

@thodson-usgs
Copy link
Copy Markdown

Summary

The axis-factored conservative regridder applied its (sparse) per-axis weights with a single multi-operand xr.dot(data, W_lat, W_lon, optimize=True). That contraction only finds an efficient separable path when opt_einsum is installed. Installing the conservative-2d extra (which brings sparse) without accel (which brings opt-einsum) hits the slow path: the sparse multi-operand einsum is 20–80× slower (13–29 s for a 48-step 0.5° field) and returns a sparse.COO result — a wasteful container for a dense field.

This applies each per-axis weight with a scipy CSR sparse-dense matmul instead (_csr_apply_axis, via apply_ufunc so dask / output_chunks / skipna are preserved). One unified path — the weight is compressed to CSR whether stored as sparse.COO or dense. Separability makes the sequential per-axis contraction identical to the fused one. It is fast with or without opt_einsum, keeps the weights compressed (the (n_src, n_dst) matrix is never densified — it reaches hundreds of MB–GB at high resolution), and yields a dense result. scipy is already a base dependency, so no new requirement.

Benchmark (no opt_einsum, eager, ntime=48)

grid before (sparse xr.dot) after (CSR) speedup
1° → 2.5°, skipna False / True 3210 / 7137 ms 62 / 94 ms ~52–76×
0.5° → 1°, skipna False / True 13260 / 29309 ms 223 / 371 ms ~59–79×

With opt_einsum present the old path was already fast; CSR matches it. Numerical results are unchanged (sequential vs fused contraction differs only at ~1e-15).

Tests

  • test_conservative_conserves_known_integral — gold-standard known-integral conservation (cos²(lat)·(1.5+sin(lon)) → 4π, conserved to ~4e-7 measured with independent analytic spherical cell areas).
  • test_conservative_returns_dense_output — regression guard that the result is a dense ndarray.
  • The full existing suite (CDO comparisons, NaN handling, dask chunking, attrs) passes; verified identical results (max abs diff 0.0) on both the sparse-installed and no-sparse code paths.

Note

This effectively supersedes #49's sparse weights for the axis-factored method's apply step — the weights are still stored compressed; they are just multiplied via CSR instead of a sparse einsum. utils.overlap still builds the weight dense transiently during the build; making that sparse end-to-end is a separate, deeper optimization left for later.

🤖 Generated with Claude Code

…r without opt_einsum)

The axis-factored conservative method stored its weights as `sparse.COO` (xarray-contrib#49)
and applied them with one multi-operand `xr.dot(data, W_lat, W_lon, optimize=True)`.
That contraction only finds an efficient separable path when `opt_einsum` is
installed; without it -- e.g. installing the `conservative-2d` extra (which brings
`sparse`) but not `accel` (which brings `opt-einsum`) -- the sparse multi-operand
einsum is catastrophic (13-29 s for a 48-step 0.5deg field) and the result comes
back `sparse.COO`, a wasteful container for a dense field.

Apply each per-axis weight with a scipy CSR sparse-dense matmul instead
(`_csr_apply_axis`, via `apply_ufunc` so dask / output_chunks / skipna are
preserved). One unified path: the weight is compressed to CSR whether stored as
`sparse.COO` or dense. Separability makes the sequential per-axis contraction
identical to the fused one. This is fast and robust whether or not opt_einsum is
present, keeps the weights compressed (the (n_src, n_dst) matrix is never
densified -- it reaches hundreds of MB to GB at high resolution), and yields a
dense result. scipy is already a base dependency, so no new requirement.

Speedups without opt_einsum: ~56x at 1deg, ~60-80x at 0.5deg. Numerical results
unchanged (sequential vs fused contraction differs only at the ~1e-15 fp level).

Tests (tests/test_regrid.py):
 - test_conservative_conserves_known_integral: gold-standard known-integral
   conservation -- cos^2(lat)*(1.5+sin(lon)) -> 4*pi, conserved to ~4e-7 measured
   with independent analytic spherical cell areas.
 - test_conservative_returns_dense_output: regression guard that the result is a
   dense ndarray, not sparse.COO.

mypy: add a scipy.* override (scipy ships no stubs) and annotate the new returns;
error count unchanged from baseline.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant